Personal Loan Campaign Case Study

Background & Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

Problem Statement

We have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

1.To predict whether a liability customer will buy a personal loan or not.

2.Which variables are most significant.

3.Which segment of customers should be targeted more.

Data Dictionary

This dataset contains the information of the AllLife Bank's liability customer data

Loading Libraries

Load Dataset

View the first and last 5 rows of the dataset

Understand the shape of the dataset

Observations:

Let us check for null values and duplicates

Observations

Check the data types of the columns for the dataset

Observations:

Summary of the dataset

Observations:

Lets us look at different columns for Missing Data

Observations:

Missing value Treatment

Observations:

Check the individual age group to look for any patterns for the negative values

Observations:

Datatype Conversions

Observations:

Data Preprocessing - Feature Engineering

Let us map Zipcodes to Counties and group the data

Observations:

Lets replace the None values in County column with Unknown value

Convert the data type of ZIPCoce_County to category

Observations:

Let us bin the Age, Experience, Income, CCAvg and Mortgage into ranges for better EDA plotting

Observations:

Let us Rename the Education column numeric values to respective named buckets

Observations:

Univariate Analysis

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Bivariate Analysis:

Lets Plot the Heat Map

Observations:

Lets plot the Pair Plot to see the correlation between variables

Observations:

Lets plot stacked bar for categorical variables

Personal_Loan vs Family

Observations:

Personal_Loan vs Education

Observations:

Personal_Loan vs Age

Observations:

Personal_Loan vs Experience

Observations:

Personal_Loan vs Income

Observations:

Personal_Loan vs CCAvg

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Observations:

Data Pre-Processing

Drop Id, ZIPCode column as it has no statistical significance

Outliers Detection using boxplot

Observations:

Let's treat using capping method and check again.

Outlier Treatment

Lets check the Outlier Treatment

Observations:

Creating a function to split, encode and add a constant to X

Split the data for Model Building

Function to calculate the metric score

Function to draw the confusion matrix

Build the Logistic Regression Model

Multicollinearity assumption check and removal

Check for data correlation

Since Age and Experience are highly correlated they have high VIF score , we will drop Age and check the VIF and model performance.Mortgage p value is also high, we will drop that as well.
Since the p values of almost all the ZIPCode_County variables are more than 0.05, we will try to drop that column from the data and recheck the model performance

Now all the VIF scores are in the acceptable range of less than 3. Multicollinearity has been resolved.

Dropping either Age or Experience do not have much effect on either of p values.So dropping both and checking the model performance.
Now only Family_2 have high p values, dropping that alone and checking the model performance.

Metrics of final model 'lg5'

ROC-AUC

Coefficient interpretations

Converting coefficients to odds

Odds from coefficients
Percentage change in odds
Coefficient interpretations

Model Performance Improvement

Let's see if the f1 score can be improved further, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

Let's use Precision-Recall curve and see if we can find a better threshold

Model Performance Summary

Conclusion

We have been able to build a predictive model that can be used by the bank to find the customers who will buy a Personal Loan with an f1_score of 0.80 on the training set and 0.77 on test data. (Logistic Regression - Optimal threshold = 0.35 - with significant predictors).

Coefficient of some levels of are Income, Family size, CCAvg, Education, CD_Account are positive, an increase in these will lead to increase in chances of a customer buying the loan.

Coefficient of Securities account, Online and CreditCard are negative, increase in these will lead to decrease in chances of a customer buying a Personal Loan.

Build Decision Tree Model

Insights:

* True Positives:

Reality: A customer buys a loan.

Model predicted: The liability customer will get converted to a loan customer buying a loan.

Outcome: The model is good.

* True Negatives:

Reality: A customer did NOT buy a loan.

Model predicted: The liability customer will NOT get converted to loan customer.

Outcome: The business is unaffected.

* False Positives:

Reality: A customer did NOT buy a loan.

Model predicted: The customer will get converted to a loan customer buying a loan.

Outcome: The team which is targeting the potential customers will be wasting their resources on the people/customers which will not be a very big loss compared to losing a customer who will buy a loan.

* False Negatives:

Reality: A customer buys a loan.

Model predicted: The customer will NOT buy a loan.

Outcome: The potential customer is missed by the sales/marketing team, the team could have offered the potential customer some discount or loyalty card to make the customer come again to purchase. (Customer retention will get affected.)

In this case, not being able to identify a potential customer is the biggest loss we can face. Hence, recall is the right metric to check the performance of the model.

Visualizing the Decision Tree

The tree above is very complex and difficult to interpret.

Reducing Overfitting

Using GridSearch for Hyperparameter tuning of our tree model

Visualizing the Decision Tree

Cost Complexity Pruning

Total impurity of leaves vs effective alphas of the pruned tree

Observations:

Visualizing the Decision Tree for the best_model with highest cc_alpha = 0.0067

Feature Importance

Income is the top most important feature to predict a liable customer conversion to a loan customer.

EDA on incorrectly Predicted data on best_model

Observations:

Lets build a model with 0.029 ccp_alpha(second highest value) to compare the performance

Observations:
Observations:

Visualizing the Decision Tree for the best_model2 with cc_alpha = 0.029

Comparing all the decision tree models

Decision tree model with post pruning has given the best recall score on the test data.

Conclusions from Decision Tree

Recommendations

According to the decision tree model -